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n Ar-i^nROUND INVENTION 

p^oM nf the Tnvention 

This invention relates to data processing systems. More partieularly, this 
5 invenUon relate to the selection of a performance level to be used by a data 
processing sy^em capable of operating at a plurahty of d.fferent performance levels. 

I, is known to provide data processing systems capable of operating a. a 
plurality of different performance levels. Typically, a relatively low perfomtance and 
low power consumption performance level will be used when maximum processmg 
performance is no. re,uired whereas when processing intensive ope«tio„s are betng 
performed, then a higher perfonnance level will be selected a. the expense of 
1 5 consuming more power. 

AS an example of the type of processing systems capable of operating a. 
different perfonnance levels, the processors produced by Intel Corporation 
incorporating their SpeedStep technology operate in a high power, high speed mode 
.„ as well as one or more lower power low speed modes. Switching be^veen these 
performance levels is typically carried out in dependence upon sensed external 
parameters, such as whether or not the system is connected to a mains power supply 
or a battery power supply. 

I, is also known to provide more dynamic performance level management 
based upon flte dynamically determined processing demands placed upon the data 
processing system. An example of such an approach is the LongRun software control 
of processor clock speed applied in the processors produced by Transmeta. Such 
software attempts to reduce the processor clock ftequency and accordingly the power 
30 consumed when the processing demands being placed upon the processor are hght 
and increase the clock ftequency to obtain high performance when the processtng 
demands are greater. 
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A problem with this approach is that in order to ensure that the power savmg 
techniques do not interfere with the usability of the system, the software tends to 
make safe assumptions regarding the desired performance level and to run the system 
at a higher average performance level than is truly required. This wastes power. In 
5 addition, the algorithm used will tend to be suited to some types of processing activity 
but not others and this will further render the performance level control inaccurate. 

gTTiVflVIARY OF T HF INVENTION 

Viewed from one aspect the present invention provides a method of selecting a 
,0 performance level to be used by a data processing apparatus capable of operating at a 
plurality of different performance levels, said method comprising the steps of: 

calculating a plurality of performance requests using respective ones of a 
plurality of performance request calculating algorithms; 

combining said plurality of performance requests to form a global 

15 performance request; and 

selecting said performance level to be used by said data processing ^paratus 
from among said plurality of different performance levels in dependence upon sa.d 
global performance request. 

20 The invention recognises that improved performance level control can be 

achieved by calculating a plurality of performance level requests with a plurahty of 
performance request calculating algonthms and then combining those separate 
performance requests to form a global performance request that is then used to control 
the performance level that is to be selected. In this way, different performance 
„ request calculating algorithms can be used each suited to different circumstances and 
each requiring less conservative assumptions to be made than if only a smgle 
algorithm was used. The combination of the plural performance requests to fom, the 
global perfonnance request allows moderation between the different performance 
requests in a manner gives improved performance level selection tirat more accurately 
30 matches the performance level truly needed. 
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The perfonnance request calculating algorisms are preferably independent of 
^ base tlteir calculafons .on detected operating cHaracter..cs of the 

data processing apparatus. 

preferred en,bodin,ents of the ,nve„..o„ also allow the .epa — 

— » ^ — :r ::::rtor^^^^^^^^ 

~r "rr: i t. . »hieh the .freren. performance 
Zr!:^: Igontmns can haae the. calcnlationa and .t not res.r.ct the 
fteVency or thning of when performance ,eve, ch^ges may be made. 

^ preferred embodimenta of .he invention the performance request calculating 
^ in a hierarchy «i.h the respective performance requests bemg 

:r:r;i::::rc:;r:e wi.n .e .erarchy of th^r ong^ating 

performance request calculating algorithm. 

„ win be appreciated that the hier^hy may be My ordered or partially 
H d l ie case of partial ordering an operator may be provided for combtmng 
ordered. In the case oi p . , „ maximum value selector), 

performance requests from the same hierarchy level (e.g. a tn— 

^quesu will be selected in preference to those other requests. 

^^proved flexibility may be achieved in the way that performance r^~ 

30 ^formancerequestssbouldbecombinedwithotberperformancerequests. 

AS preferred examples, the commands can specify that a performance requ«. 
H uldtvlde any perfonnance requests from a less dominant posmon m me 
should ovemde any p performance level 

hierarchy, should be selected m preference to any 
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performance request from a performance calculating algorithm in a less dominant 
position or should be ignored. 

The combining of performance requests is conveniently and methodically 
5 performed when these are combined starting from a performance request from a least 
dominant position working through to a performance request from a most dommant 
position. 

The steps of calculating, combining and selecting may be performed in 
0 different parts of the system such as an operating system kernel, firmware of the data 
processing apparatus or hardware within the data processing apparatus. Different 
ones of the steps of calculating, combining and selecting may be performed m the 
same or different places as mentioned. 

15 The performance request calculating algorithms can be responsive to a variety 

of different parameters, such as deadline information from a real time operating 
system kernel, general information from an operating system kernel or information 
from an application program, which may for example provide its own performance 
level requests to the performance level managing system via an appropriate 

20 application program interface. 

Viewed from another aspect the present invention provides apparatus for 
selecting a performance level to be used by a data processing apparatus capable of 
operating at a plurality of different performance levels, said apparatus compnsing: 

calculating logic operable to calculate a plurality of performance requests 
using respective ones of a plurality of performance request calculating algonthms; 

combining logic operable to combine said plurality of performance requests to 

form a global performance request; and 

selecting logic operable to select said performance level to be used by said 
30 data processing apparatus from among said plurality of different performance levels m 
dependence upon said global performance request. 

Viewed from a fiirther aspect the present invention provides a computer 
program product bearing a computer program for confroUing a computer to select a 



5 



DYC Ref: P015674USY 
ARM Ref: ?262 



performance .eve, to be used by said computer, said computer being capable of 
o^Ig a. a p.»a,..y of differen. performance ieveis. said computer program 

^"""""laring code operab.e .o oa«a.e a pi.aU.. of perfomu>nce^ues. 
5 usingrespectiveones of aplurali.y of perfonnanoe request calculating algonthms; 

form a elobal performance request; and , , . ^ ^ . 

electilg code operab.e to select said perfonnance level to be used by sa,d data 
processing apparatus from among said pluraUt, of different performance levels m 
,0 dependence upon said global performance request. 

The above, and other objects, features and adv^tages of this invention will be 
apparent from the following detailed description of illustrative embodiments whrch .s to 
be read in comiection with the accompanying drawmgs. 

,p.irir nrsCRl "--'"''' T«E DRAWING^ 

Figure, schematically illustrates howapower management system according 

to «te present technique may be implemented in a data processing system; 

Figure 2 schematically illustrates three hierarchical layers of the performance 
setting algorithm according to the present technique. 

Figure 3 illustrates the strategy for setting the processor performance level 

during an interactive episode; 

Fig^ 4 schematicaUy illustrates execution of a worldoad on the processor 
and calculation of the utilisation-history v,indow for a task A; 

Figu«, 5 schematically illustrates an implementation of the three-layer 
30 hierarchical performance policy stack of Figure 2; 

Figure 6 schematically illustrates a work-tracking counter 600 according to the 
present technique; 
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Figure 7 schematically illustrates an apparatus that is capable of providing a 
number of different fixed performance-levels in dependence upon workload 

characteristics; 

Figure 8 is a table that details simulation measurement results for a 
'plaympeg' video player playing a variety of MPEG videos; 

Figure 9 is a table that lists processor performance levels statistics during the 
runs of each workload; 

Figure 10 comprises two graphs of results for playback of two different 
MPEG movies entitled 'Legendary' (Figure lOA) and 'Danse de Cable' (Figure 10 

B); 

Figures 1 1 A, B and C schematically illustrate the characteristics of two 
different performance-setting policies; 

Figures 12A, B and C schematically illustrate simulation results for different 
performance-setting algorithms tested on mteractive workloads; 

Figure 13 schematically illustrates statistics gathered using a time-skew 
correction technique. 

i^iTRrRTPTlO N ™ PREFF ^"^" FTVmomMENTS 

Figure 1 schematically illustrates how the power management system may be 
implemented in a data processing system. The data processing system composes a 
kernel 100 having standard kernel functional modules including a system calls module 
112 a scheduler 114 and a conventional power manager 116. An intelUgent energy 
mariager system 120 is implemented in the kernel and comprises a policy co-ordmator 
122 a performance setting control module 124 and an event-tracing module 126. A 
user processes layer 130 comprises a system calls module 132, a task management 
n^odule 134 and application specific data 136. The user processes layer 130 supphes 
information to the kernel 100 via an application-monitoring module 140. 
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The kernel 100 is *e core 4a. provides basic services for o*er parts of U.e 
operatfng system. The kerne, can be contrasted with the she... which is the —s 
r^ of L operating system that interacts with user commands. The code of th^ 
kl. is exited with complete access privileges tor physical resources, such . 
llry. on its host system. TTre services of the kernel are requested by other parts of 
or by an application pro^ trough a set of program "s ~ 
systl call. Bo* the user process layer and the kernel have system calls modules 
T T32 The scheduler 1 14 determines which programs share the kernel's process.ng 
and in what order. A supervisor (not shown) within the kernel gives use of the 
: Ir .0 each process a. the scheduled «me. The conventtonal power manager 
16 nranages the supply voltage by switching the processor 
conserving sleep mode and a standard awake-mode in dependence upon the level of 
processor utilisation. 

intelUgem energy manager. 120 is responsible for calculating and setting 
processor perfonnance targets. Rather than relying only on sleep mode for pow^ 
r— . the intelligent energy manager 120 allows the central proce„ 
TJu, operating voUage and the processor clock .e,uency » be r^c^ w*^« 
ausit^g L apphcation software .0 miss process (i.e. task) deadlmes. When the CPU 
, sUg at Z capacity many processing tasks will be completed in advance of th.r 
' " * , ,he nrocessor will idle until the next scheduled task .s begun. 

r::prrC!X-ataskthatprodueesdata.sthepointatwMch.e 

An example . ^^k The deadline for an interact.ve task would 

nroduced data is required by another lasK. ^ a 

proouceaoam m ,„„ m,l Goinn at full performance and 

bethepe,x>eptionthresholdoftheuser(50-100 ms). Gomgat p 

, then ilg is less ene^-efficient than completing the task more slow y so tha m 
!I ine i met more exactly. When the processor f.ec,uency is reduced U.e voltag^ 
Itbe scaled down in order to achieve energy savings. For processors tmplemen.ed 
r omplementary metal-oxide semiconductor (CMOS) technology the energy used 
; Xn wor Joad is proportional .0 the voltage squared. T.e poUcy co-or.na^ 

30 la es multiple perform^ce-sening algorithms, each being appr^^e . ^ 

— — edTirte^^ixtTm:::.. 
:;::r:eX— -setttng ..^^ ^ — .ca^s a 

processor performance by prioritismg these results. T.e evem tractng module 
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n6 n,o„«o. s.s.e« evenU both in .he .en,e> .10 and i„ user p^cess la^ ^ 
and feed, d-e infom.a.,o„ ga*ered .o the perfonnance seUing conTo. module 124 and 

the policy co-ordinator 122. 

in ,he user proeesses layer, processing work is monitored via. system call 
events 132; processing task events 134 including task switching, task creatton and 
e^l; and via applicaUon-speciftc data. The intelligent energy manager 
n ilp.emen.ed as a set of kernel m«lules and patches tha. hook into the stand. 

lei Motional modules and serve to control tt-e speed and voltage levels of 
Tcessor -me way in which the intelligent energy manager 120 . .mplemen d 
ZZ «.aavely autonomous ^m oflter modules in the kernel 100. Thts has he 
rige of making the perfomtance setting control mechanism less inUustve to dr 
Z operating system, hnplementation in the kernel also means that user app cat. 
p„grls need no. be modified. Accordmgly. the intelligent energy manager 120 co 
sTtith d,e sys.em calls module .12. *e scheduler 1.4 and the conventtona, 
Zllanager , 6 of the kernel, alUrough it may r.,uire c^ hooks w.d,m thes 
b" The .ntelligen. energy manager 120 is used to derive task deadUnes and 

risificatton utfo^ation (e.g. whether the task associated — - 
application) from U,e OS kernel by examining Ure commumcaUon patterns between 
Z Ll, tasks. I. also serves to monUor which system calls a« accessed by each 
.ask and how data flows between ttte communication structures m the kernel. 

Figure 2 schematically illustrates three hierarchical layers of the performance 
setting algorithm according to the present technique. 1. should be noted that on a gtven 
, ;lsor the „y/vol.age set«ng options are typically discrete rather U,an 

nUnuo,.. AcconBnglythetargetprocessorperformancelevelmustbcchosenftom 
a fixed set of predetermined values. Whereas known techniques of calculattng a 
target processor perform^mce level involve use of a single performa^ce-settmg 
a,gLtl,thepresenttechnic,ueu.ilisesmu.,ip.ealgorithmseachofwh.chhave 

30 differentcharacteristicsappropriatetodifferentrun-timesiurations. T^emo^ 

applicable algoridun to a given processing sit^on is selected a. and 
ply co-ordinator module 122 co-ordinates the performance sett, ng algonthms ^d 
by coLecttng to hooks in the siandard kernel 1 .0. provides shared fimcttonaUty to the 
!Ze perf Lance setting algonthms. The results of the mulfip.e performance- 
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setting algorithms are collated and analysed to determine a global estimate for a target 
processor performance level. The various algorithms are organised into a decision 
hierarchy (or algorithm stack) in which the performance level indicators output by 
algorithms at upper (more dominant) levels of the hierarchy have the right to override 
the performance level indicators output by algorithms at lower (less dommant) levels 
of the hierarchy. Th. example embodiment of Figure 2 has three hierarchical levels. 
At the uppermost level of the hierarchy there is an interactive appUcation performance 
indicator 210, at the middle level there is an application-specific performance 
indicator 220 and at the lowermost level of the hierarchy there is a task-based 
processor utilisation performance indicator 230. 

The interactive appUcation performance indicator 210 calculation is 
perfromed by an algorithm based on that described in Flautner et. Al. "Automatic 
Performance-setting for Dynamic Voltage Scaling", Proceedings of the International 
Conference on Mobile Computing and Networking, July 2001. THe interactive 
appUcation performance-level prediction algorithm seeks to provide guarantees good 
interactive performance by finding the periods of execution that directly mipact the 
user experience and ensuring that these episodes complete without undue delay. The 
algorithm uses a relatively simple technique for automatically isolating interactive 
episodes. This technique reUes on monitoring communication from the X server, 
which is the GUI controller, and tracking the execution of the tasks that get tnggered 
as a result. 

The begimiing of an interactive episode (which typically comprises a 
multiplicity of tasks) is initiated by the user and is signified by a GUI event, such as 
pressing a mouse button or a key on the keyboard. As a result this event, the GUI 
controller (X server in this case) dispatches a message to the task that is responsible 
for handling the event. By monitoring the appropriate system calls (various versions 
of read, write, and select), the intelligent energy manager 120. can automatically detect 
, the begimiing of an interactive episode. When the episode starts, both the GUI 
controUer and the task that is the receiver of the message are marked as being m an 
interactive episode. If tasks of an interactive episode communicate with unmarked 
tasks then the as yet umnarked tasks are also marked. During this process, the 
intelUgent energy manager 120 keeps track of how many of the marked tasks have 
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been pre-empted. The end of d.e episode is reached when fte number of pre-empted 
tasks is zero, indicating that all tasks have run to completion. 

Figure 3 illustrates the strategy for setting the processor performance level 
during an inte^ctive episode. The duration of an interactive episode is known to vary 
by several orders of magnitude (fiom around ,0^ up to around 1 second). However 
a ^nsition-star. latency or "skip threshold' of 5 milhseconds is set to filter ou. the 
shortest interactive episodes thereby reducing the number of requested performance- 
level transitions. The sub-mUBsecond interactive episodes axe typically the results of 
echoing key presses to the window or moving the mouse across the screen and 
redrawing small rectangles. The skip threshold is set a. 5 milliseconds smce th.s 
allows short episodes to be filtered ou. of the perfonnance indicator predtcUons 
without adversely impacting the worst case. 

If the intemctive episode duration exceeds the skip threshold, then the 
associated performance-level value is includ^l in *e overall inter^tive performance- 
,eve. prediction. Tl.e performance factor for thenextinte^tive episodes grvenbya 

weighted exponentially decaying average of calculated performance factors for aU 
past mtcractiveepisodes. Note that accordingtothepresentteehnique the mt^hve 

application performance settmg algorithm uses a smgle global prediction for *e 
necessary perfonnance level for an interactive episode in the system. (Thrsd-fTers 
ftom the technique described in the above-mentioned publicatton, accordmg to whrch 
a per-task performance-level prediction was used depending on which task imuated 

the episode.] 

To bound the worst-case impact of an erroneous perfonnance-level prediction 
on the user experience, if the interactive episode does not finish before reaching a so- 
called -panic Uueshold-. then the performance-level prediction of the top hierarchtcal 
layer is specified U> be the maximum performance level. Smce this is a top-level 
„ prldietion it wil, be enforeed by the system. A. the end of .he offending m.eraCvc 
episode, .he interactive algorithm computes what *e correct performance-setting for 
,he episode should have been and this corrected value is incorporated m.o the 
exponentiallydecayingaverageso*a.i.wiUinfluencefi,.«repredic.ions. An 

addtiional optimisation is performed such tha. tita. if tire panic threshold was m fae. 
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reached during an interactive episode, the moving average is re-scaled so that the 
corrected performance level is incorporated in the exponentially decaying average 
with a higher weight (k=l is used instead of k=3). The performance prediction is 
computed for all episodes that are longer than the skip threshold. 

Interactive episode 'deadlines' are used to obtain a performance-level 
indicator for each identified interactive episode. The deadline is the latest time by 
which a task must be completed to avoid adversely impacting performance. The 
performance level indicator for interactive episodes is calculated in dependence upon 
the human perception threshold associated with the particular interactive event. For 
example it is known that a rate of 20 to 30 frames per second is fast enough for the 
user to perceive a series of images as a continuous stream so that the perception 

Work J,, 

~ Perception Threshold 
threshold could be set to 50ms for an interactive image display episode. Although the 
exact value of the perception threshold is dependent on the user and the type of task 
being accomplished, a fixed value of 50ms was found to be adequate for the 
interactive algorithm of the hierarchy. The equation below is used for computmg the 
performance requirements of episodes that are shorter than the perception threshold. 

where the full-speed equivalent work Workfse is measured from the begimiing 
of the interactive episode. 

The application-specific performance indicator 220 of the middle hierarchical 
layer is obtained by collating information output by a category of application 
programs that are aware of performance level setting fixnctionality. These program 
applications have been adapted to submit (via system calls) specific information to the 
intelligent energy manager 120 about their specific performance requirements. The 
operating system and application programs can be provided with new API elements to 
facilitate this communication regarding performance requirements. 

The perspectives-based performance indicator 230 is obtained by 
implementing a perspectives based algorithm that estimates future utilisation of the 
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processor based o„ .he recen, utilisatton his»ry. This algoriten derives a uUUsa,,o„ 
estimate for each individual task and adjusts the size of a .he time period over «h,ch 
fte utilisation-hisiory is calculated {i.e. the utilisation-history window) on a task by 
task basis. The perspectives-based algorithm takes account of all categories of task 
being performed by the processor whereas the interactive application algonthm of the 
uppermost layer takes account of interactive tasks. Since the .nteractive appUcation 
algoritiMn calculates a performance-level indicator that aims to guarantee a h,^ 
ouality of int^active perfonnance and it is situated at the uppermost level of U,. 
hierarchy, tite pet^ectives-based algorithm need not be consti-ained to a 
conservatively short utilisation-history window. The possibility of using longer 
utilisation-history windows at titis lowermost hierarchical level allows for improved 
efficiency since a more aggressive power reduction strategy can be selected when 
appropriate. If U>e utilisation-history window is too short, this can cause U>e 
performance-level predictions to oscillate rapidly between two fixed values. I. .s 
typically necessary to set a short utilisation history window where a single umfled 
algorithm (rather than a hier^chical set of algorithms) is used U> set the perfonnance 
,eve, for all run-time circumstances. To be able to cope with intermittent processor- 
intensive interactive events, such unified algorithms must keep «.e utilisation-htstory 
window short. 

Each of ttte performance-setting algorithms of flie 3-layer stack uses a measure 
of processing work-done in a given time inte^al. In this embodiment, tire worMone 
measure ti^at is used is the full-speed (of the processor) equivalent work Wor^. 
performed in that time interval. This full-speed equivalent work estimate is calculated 
according to the following formula: 

n 

1=1 

where i is one of n different processor perfonnance levels implemented during 
■ the given time interval; U is the non-idle time in seconds spent at perfonnance level ■; 

and p. is tire processor performance level i expressed as a fraction of the peak (full- 
0 speed) processor perfonnance level. This equation is valid on a system in wh.ch a 
time-stamp comtter (work counter) measures real time. The work-done would be 
calculated differently in alternative embodiments that use cycle counters whose count 
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rate varies according to the current processor frequency. Furthermore, the above 
equation makes the implicit assumption that the run-time of a workload is inversely 
proportional to the processor frequency. This assumption provides a reasonable 
estimate of work-done. However, primarily due to the non-linearity of bus speed to 
5 processor speed ratios during performance scaling, the assumption is not always 
accurate. In altemative embodiments the work-done calculation can be fine-tuned to 
take account of such factors. 

Figure 4 schematically illustrates execution of a workload on the processor 

10 and calculation of the utilisation-history window for a task A. The horizontal axis of 
Figure 4 represents time. Task A first starts execution at time S, whereupon a number 
of per-task data structures are initialised. There are four of these data structures 
corresponding to the following four pieces of information: (i) the current state of the 
work counter; (ii) the current (real) time; (iii) the current state of an idle-time coimter; 

15 and (iv) a run bit is set to logical level "V indicating that the task has started running. 
The work-counter, the real-time counter and the idle-time coxmters are used to 
calculate the processor utilisation associated with task A and subsequently to calculate 
task A's performance requirements. At time PE, task A has not yet run to completion 
but is pre-empted by another task, task B. Pre-empting will occur when the task 

20 scheduler 114 determines that another task has higher priority than the task that is 
currently running. When task A is pre-empted the run bit is maintained at logical level 
of T to indicate that the task still has work to complete. At time RE, task A resumes 
execution, having been rescheduled, and continues to execute until it has run to 
completion at time TC whereupon it voluntarily gives up processing time. On 

25 completion task A may initiate a system call that yields the processor to another task. 
On completion of task A at time TC the run-bit is reset to logical level '0\ 

After time TC, there is an idle period followed by execution of a further task C 
and a subsequent idle period. At time RS, task A begins execution for a second time. 
30 At time RS, the '0' state of the run-bit associated with task A indicates that 
information exists to enable calculation of task A's performance requirements so that 
the processor target performance level can be accordingly set for the imminent re- 
execution of task A. The utilisation-history window for a given task is defined to be 
the period of time from the start of the first execution of the given task to the start of 
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the subsequent execution of the given task and should include at least one pre-empting 
event (task A is pre-empted by task B at point RE in this case) of the given task within 
the relevant window. Accordingly, in this case the utilisation-history window for task 
A is defined to be the time period from time S to time RS. The target performance 
level for task A in this window is calculated as follows: 
WorkEstNew = (k xWorkEstoid+ Workfs^ / (k+J) 

DeadlineNe^ = (k xDeadlineoid + (Workfse + Idle)) / (k+1) 

where k is a weighting factor. Idle is the idle time in seconds in the time interval from 
time S to time RS in Figure 4 and the deadline for task A is defined to be (Workfse + 
Idle). In this particular example, performing detection of pre-empting tasks such as 
task B in Figure 4 guides the algorithm in deteimining the utiUsation-history window 
for each task. Processing tasks that are run before the next non-preempted scheduling 
of task A are often highly correlated with the execution of task A. The idle time 
between time points TC and RS is the "slack" that can be taken up by running the 
processor at a reduced performance level. However, task C is factored into the 
performance level calculation since it diminishes the available slack. 

The above equations for WorkEstNew and DeadlineNe. each represent an 
exponentially decaying average. Such exponentially decaying averages allow for 
more recent estimates to have more influence on the average than less recent 
estimates. The weighting factor k is a parameter associated with the exponentially 
decaying average. It was found that a value of k=3 worked effectively and this small 
value indicates tiiat each estimate is a good estimate. By keeping frack of the work 
predictor and the deadline predictor separately, the performance predictions are 
weighted according to the lengtii of the utilisation-history window. This ensures that 
the performance estimates associated with the larger window sizes do not dominate 
the performance prediction. The performance level indicator Perfperspeca.es-basea for this 
algorithm is given by the ratio of the two exponentially decaying averages: 
Perf^ers^^^-^as^- WorkEstNe. f Deadline,^ A separate performance level value is 
calculated for each task. According to the sfrategy of the present technique, the work 
estimates WorkEst for a given task is re-calculated on a workload dependent time- 
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interval of be^vee„ 50 and .50 -ni.Useoonds. However, since ror^, .s .alcula.^ 
on a usk by .ask basis so .ha. each executed Usk draws on a respeCve appmpnate 
::.htd ™ vaiue, tt,e is aCuaiiy refined every 3 .o .0 „in.s.on^ 

lecting .ask swi.ching even.s). TUs a,gon.h™ differs ftom known u.»val-based 
Sis in Ura. i. derives . u.,hsa.ion es.in.a.e separa.e,y for each .ask a. a,so 
Jus. *e size of ^e „ti.isa«on.his.ory window on a .ask by ,ask bas.s. Although 
kr^^wn rifled perfonnance-seUing algortfl»s use exponen.ia«y decaying averages, 
ly ea.cu.a.e a gioba. average over al, perfonnance .asks for a fixed u..hsa..on^ 
hislry window (.0 .o 50 miUiseoonds) rattier ,han a task-based average over a 
variable task-based utilisation-history window. 

According to the perspectives-based algorithm of the present .echn,que „ is 
„y to avoid a situation occurring whe^y a new non-inter.„ve C U boun 
task utilises the processor for an extensive period without bemg P^"---'^^ 
could introduce substantial la.ency in adap.aaon ot*e performance level to the task 
Tee the u.ilisa.ion.his.ory window can only be defined once the task has been pre- 
empted at leas. once. To avoid unwan.ed perfonnance adaptation latency an upper 
r old .s se. for the non pre.emp.ed duration over which the work — . 
calculated. In particular, if a task continues without betng pr.™p.«. for ^ 
^.hseconds. then its work estimate is rec^culated by default. T.e v^ue 
billiseconds was selected by taking ,nto account tha. a more stnngen. apphcation 
Ms.ory window is ensured for in.erac.ive applications via tire domman. h.era.^.c^ 
layer ..0. which produces a separaie in.eractive application P— 
1 also consider *at the only Cass of user applications l.kely .o be affeOed by fte 
, .00 millisecond window «.esho.d are computationally intens.ve batch .obs s c a^ 
compilation, which are likely to nm for several seconds or even mmu.es. In such 
cZ an ex:a 100 milliseconds (0.1 seconds, of run time is unlikely to be s.gn,ficant 
performance-wise. 

Figure 5 schematically iUustta.es an implemenU.ion of ttte ti.ree-layer 
hierarchical performance policy stack of Figure 2. The implementation compnses a 
rformance -ndicator pol.cy stack 510 and a policy even, handler 5,0, each of wh.ch 
U.PU.S .nfonnation to a .arget performance calculator 540. The targe, perfo^^ 
calculator 540 serves to collate the result from four performanee-settmg aIgon«m,s. 
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AKM 1^61. r^w*. 

iAA^^ level aoplication-based algorithm and two 

~ TZITX^^^--'- - ' r; 

o„„cu,ren..y. The «rge. P ^dicators (in this case four) 

p^fonnance ieve, .^m *e — ^^^^^^^^^ 

produced by fl.e pohcy suci. 5 0. ^he Jo' J ^ g^,,,,, ^^ework 

for multiple performance setting p ^ kv the user Accordingly the 

can he replaced o. -^-^f ^^^: ^^^on in which user- 
performance policy-stack provides a platform 
Ltomisedpertormance-set.in.policiescanhemcorpora.ed. 

p H of *e multtple performance-setting algoriftm^ is specialised to cope 
..a!r:ii:.eg:ryofrun-.me.en.^^^^^^^ 

„ --7- — 1—:— . .0, Which of .he four 
performanoe-md.ca.ors, the ^ftw ^^^^ ^^^^ 

p^ormance-indicators ^"" "^^ ^ ^ , ^Mch a gloha, «rge. 

~"^"7:::: , :;clla.ed. .ven e^h performance-scmg 
performance-level can be vaiiaiy ^:«erent times. It must also 

IgoHdm. can run -^'^--'^lt'';:::Z^Zt the even. d.t the 
. ^ considered how .0 — ^, decisions on d.e same 

multiple performance-se..mg algonflmis 

Casing even. oU.erw.se spurious .arge. updates may occur. 

lirv ^tack 510 algorithms are organised in a 
TO address these issues .he pol. y s.ac. 1 g ^ 
, three.,evelhierarchyasshown.wherepoUc,esaU^^^^^^^ ^^^^^^ 

p„ce level re,ue.. derived^™ . ^^^^^ 

•"'T^r'Tirirr" — leve. ma, ..self comprise 
algorttiuns of ^ ^^a^,. The differen. performance- 

— — ve P— f;^; ^ „y and can hase d.eir 

30 se..ing algorrtiuns are not aware ^ P , ,„ori.hn. re,ues.s a 

performance level . suh™« ^ 

tt>e poUcy s.aclc '^'^^ = * ^ ,24 and a corresponding performance level 
comprising a command 512, 516, 5ZU, 
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indicator 514, 518, 522 and 526 is stored. The command IGNORE 520 which applies 
to the level 1 algorithm indicates to the target performance calculator 440 that the 
associated performance-level indicator should be disregarded in calculation of the 
global performance target. The command SET 512, 516 that has been specified for 
both of the level 0 algorithms causes the target performance calculator 540 to set the 
correspohding perfomiance level without regard to any performance-level request 
coming from lower in the hierarchy. However the SET command cannot override 
performance level requests from higher hierarchical levels. In this embodiment one 
level 0 algorithm has requested that the performance be set to 55% of peak level 
whereas another level 0 algorithm has requested that the performance be set to 25% of 
peak level. The target performance calculator uses an operator to combine these two 
equal-priority requests, in this case preferentially selecting the 55% value as the level 
0 performance-indicator. At level 2, the command 'SET IF GREATER THAN' has 
been specified together with a performance indicator of 80%. The 'SET IF 
GREATER THAN' command provides that the target performance calculator 540 
should set the global target performance -level to be 80% provided that this is greater 
than any of the performance indicators fi-om lower hierarchical levels. In this case the 
level 0 performance indicator is 55% and the level 1 performance indicator is to be 
disregarded so that the global target will indeed be set to 80% of peak performance. 

Since the most recently calculated performance level indicators for each 
algorithm are stored in memory by the policy stack 510, the target performance 
calculator 540 can calculate a new global target value at any time without having to 
invoke each and every performance-setting algorithm. When a new performance level 
request is calculated by one of the algorithms on the stack, the target performance 
calculator will evaluate the contents of the command-performance data structures 
from the bottom level up to compute an updated global target performance level. 
Accordingly in the example of Figure 5, at level 0 the global prediction is set to 55%, 
at level 1 it remains at 55% and at level 2 the global prediction changes to 80%. 
Although each of the performance-setting algorithms can be triggered (by a 
processing event in the system) to calculate a new performance level at any time there 
is a set of common events to which all of the performance -setting algorithms will 
tend to respond. These events are monitored and flagged by the policy event handler 
530, which provides policy event information to the target performance calculator 
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540. This special category of events comprises reset events 532, task switch events 
534 and performance change events 536. The performance change event 536, is a 
notification that alerts each performance setting algorithm to the current performance 
level of the processor although it does not usually alter the performance requests on 
the policy stack 510. For this special category of policy events 532, 534, 536, the 
global target level is not recomputed each time one of the algorithms issues an 
updated performance-level indicator. Rather, the target performance level calculation 
is co-ordinated so that the calculation is performed once only for each event 
notification after all event handlers of all interested performance setting algorithms 
have been invoked. 

Device drivers or devices themselves may be provided with an application 
program interface (API) that enables an individual device to inform the policy stack 
510 and/or individual performance setting algorithms of the policy stack of any 
significant change in operating conditions. This allows the performance-setting 
algorithms to trigger recalculation of target performance levels thereby promoting 
rapid adaptation to the change in operating conditions. For example, a notification 
could be sent by the device to the policy stack 510 when a processor-intensive CPU- 
bound task starts up. Such a notification is optional and the performance-setting 
algorithms may but need not respond to it on reception. 

Figure 6 schematically illustrates a work-tracking counter 600 according to the 
present technique. The work-tracking counter 600 comprises: an increment value 
register 610 having a software control module 620 and a hardware control module 
630; an accumulator module 640 comprising a work-count value register and a time - 
count value register; a time-base register 646; a real-time clock 650 and a control 
register 660. The work-tracking counter of this example embodiment differs fi-om 
known timestamp counters and CPU cycle counters in that the counter increment 
values are proportional to the actual work being performed by the processor at or 
close to the time that the count value is incremented. The increment value register 
610 comprises a work-done calculator that estimates the work done by the processor 
in each counter cycle. The work done estimates are obtained via the software control 
module 620 and/or via the hardware control module 630. The software control 
module 620 implements a simple work-done calculation that correlates the increment 
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value with the current processor speed. If the processor is running at 70% of peak 
performance then the increment value will be 0.7 whereas if the processor is running 
at 40% of peak-performance the increment value will be 0.4. When the software 
control module 620 detects that the processor is idle during a counter cycle then the 
5 increment value is set to zero. In altemative embodiments of the work-tracking 
counter a more sophisticated software algorithm is used to calculate a refined work- 
done estimate. 

Table 1 lists measurement data that gives a percentage discrepancy between an 

10 expected run-time duration and an actual run-time duration for both a CPU bound 
loop and for an MPEG video workload when considering a performance-level 
transition between two different processor speeds (from a higher to a lower speed in 
this case). The results are based on post-transition runs at three distinct processor 
performance levels: 300, 400, and 500 Mhz (as specified in the left-most column of 

15 table). The top row of Table 1 hsts the initial performance level firom which the 
transition to the corresponding processor speed in the left-most column was made. 
On the CPU bound loop, the difference between the predicted and actual 
measurements are indistinguishable from the noise, whereas for the MPEG workload, 
there is about a 6%-7% inaccuracy penalty per 100 Mhz step in processor frequency. 

20 The maximum inaccuracy on these workloads is seen to be less than 20% 

(19.4%),which is considered to be acceptable for a system with only a few fixable 
performance-levels. However as the available range of minimum to maximum 
processor performance levels that are selectable in a system increases and the range of 
each performance-level step decreases, it is likely that a more accurate work- 

25 estimator than the processor speed will be required. 



Post- 


CPU BOUND LOOP 


MPEG VIDEO WORKLOAD 


transition 
speed 


400 MHz 


500 MHz 


600 MHz 


400 MHz 


500 MHz 


600 MHz 


300 MHz 


-0.3% 


-0.4% 


-0.3% 


7.1% 


13.5% 


19.4% 


400 MHz 




-0.1% 


0.0% 




6.9% 


13.3% 


500 MHz 






0.1% 






6.8% 



The more sophisticated algorithm of altemative example embodiments uses a 
30 more accurate work-done estimation technique that involves monitoring the 
instruction profile (via counters that keep track of significant events such as memory 
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accesses) and *e expected and ac.ua. decease «.e of *e workload. ,a*er *an 
*e assumption .ha. .he „o*-done is di.c„y pn>po.«ona. .o *e processor 
;!:r Furd^er al.ema.ive e™l,od,™en.s use cache hi.-ra.es and .en.or,-sys en, 
::„m.ance indica.rs .0 ref.ne .he wor.-done es.in,a.e. Ve. -^^^ — 
example en,bodimen.s use software .o monr.or d,e pcrcen.age "fV^B^^^^ 
n exiting a programming application (e,ua.ed .o useS.. worMone) relative U> ti,e 
;:riseofpLessing.imeusedinperfo,minghac.,roundopera.ing-^^^^ 

The hardware contiol module 630 is capable ot estimaling work-done even 

during tiansition periods when *e processor is in d.e process of swiiching be.ween 

To L perfonnance levels. For each processor performance tians.tion .here may 

e a pause of around 20 mlcro^onds during which ti.e processor does no. .ssue any 

nsLions. This pause is due .o tire time needed .o resynchronise tire phase-locked - 

,„„ps .o ti,e new .arge. processor fte,uency. Furti.ermore. before tire processor 

Juency can be changed, .he voltage mus. be stabiUsed .o an approprrate value for 

are new targe, frequency. Accordingly, tirere is a tiansition time otup «> 1 
...lUsecond. during wh-chitcan be assumed ti,a.tireprocessor,srunnmga.*eold 

target fre,uency but energy is being consumed a. tire new target level (smce tire 
voLge has been set to the new targe, level). The frequency may be ramped up m 
several stages via intermed.ate frequency steps .o affect tire performance-level 
Change. During such tiansition penods when the frequency of tire processo rs 
changing dynamically tire hardware control module 630 is operable to update the 
"tvleregistertakingaccountoftire dynamic Changes of whichtireso^are 

I unaware. Mtirough this example embodimen. makes use of both hardware and 
, .ftware contiol modules 620. 630 to calculate the work done. — ve — 
embodiments may use only one of tirese two modules ti. es.ima.e tire work-done. 

TTre accumula.or(s) module 640 periodically reads the increment value from 
„e inc^men. value register 610 and adds tire increment value .o an accumulated sum 
30 led in tire work-count value register. The work-count value reg.s.er mcremen tire 
work-count value eve^ clock-tick The clock-tick is a time si^al denved from the 

timelek650. Tomeasu^tirework-doneduringapredetermmedt^^^^^^^^^^^ 
:cwork-cou„tva,ues.oredinti.eaccnmulat„Ks)mod»le6.0isr^~- 
are be^nning of tire predetermined time interval and once a. tire end. The drfference 
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between these two values provides an indication of the work-done during the 
predetermined time interval. 

The real-time clock 650 also controls the rate at which the time-count value 
5 stored in the register 644 is incremented. This time-count value register works on the 
same time base as the work-count value but is used to measure time elapsed rather 
than work done. Having both a time counter and a work-done counter facilitates 
performance-setting algorithms. The time-base register 646 is provided for the 
purpose of multi -platform compatibility and conversion to seconds. It serves to 
10 specify the time base (frequency) of the two counters 642, 644 so that time can be 
accurately and consistently be i.e. the accumulated value stored in the time count 
value register provides an indication of the time elapsed in miUiseconds. The control 
register module 660 comprises a two control registers, one for each counter. A 
counter can be enabled, disabled or reset via the appropriate control register. 

15 

Figure 7 schematically illustrates an apparatus that is capable of providing a 
number of different fixed performance-levels in dependence upon workload 
characteristics. The apparatus comprises a CPU 710, a real-time clock 720, a power 
supply control module 730 and the increment value register 610 of the work-tracking 

20 counter of Figure 6. The power supply control module 730 determines which of the 
fixed performance-levels the CPU is currently set to run at and selects an appropriate 
clock frequency for the real-time clock 720. The power-supply control module 730 
inputs information on the current processor frequency to the increment value register 
610. Accordingly the value of the increment is proportional to the processor 

25 firequency, which in turn provides an estimate of useful work-done by the processor. 

Many of the performance-setting algorithms of the policy stack 510 use the 
utihsation history of the processor over a given time interval (window) to estimate the 
appropriate future target speed of the processor. The principal objective of any 
30 performance-setting policy is to maximise the busy time of the processor in the period 
from the start of execution through to the task deadUne by reducing the processor 
frequency and voltage levels an appropriate target performance level. 

To enable the target performance level to be realistically predicted, the 
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intelligent energy manager 120 provides an abstraction for tracking the actual work 
done by the processor during a given time interval. This work-done abstraction 
allows performance changes and idle time to be taken into account regardless of the 
specific hardware counter implementations, which can vary between platforms. 
According to the present technique, to obtain a work measurement estimate over a 
time interval, each performance-setting algorithm is allocated a 'work structure' data 
structure. Each algorithm is set up to call a 'work-start function' at the beginning of 
the time interval and a 'work-stop function' at the end of the given time interval. 
During the work-done measurement, the contents of the work structure are 
automatically updated to specify the proportion of idle time and the proportion of 
utiUsed processor time weighted by the respective performance levels of the 
processor. The information stored in the work structure is then used to compute the 
full-speed equivalent work value (Workfse), which is subsequently be used for target 
performance-level prediction. This work-done abstraction functionality, which is 
implemented in software in the inteUigent energy manager 120 provides performance- 
level prediction algorithm developers with a convenient interface to the intelligent 
energy manager 120. The work-done abstraction also simplifies porting of the 
performance-setting system of tiie present technique to different hardware 
architectures. 

One significant difference between alternative hardware platforms is the 
manner in which time is measured on the platform. In particular, some architectures 
provide a low overhead method of cycle-counting via timestamp counters whereas 
other architectures only provide the user with externally programmable timer 
interrupts. However even when timestamp counters are provided they do not 
necessarily measure the same things. For example a first category of hardware 
platforms includes both current Intel [RTM] Pentium and ARM [RTM] processors. In 
these processors the timestamp counters count CPU-cycles so that the count-rate 
varies in accordance with the speed of the processor and the counter stops counting 
when the processor enters into sleep mode. A second category of hardware platforms, 
which includes the Crusoe [RTM] processor, have an implementation of the 
timestamp counter that consistently counts the cycles at the peak rate of the processor 
and continues to increment the count at the peak rate even when the processor is in 
sleep mode. The work-done abstraction facilitates implementation of the present 
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target performance-setting technique on both of these two alternative categories of 
hardware platform. 

The work estimate Workfse as calculated in this embodiment does not take 
account of the fact that a given workload running at half of peak performance does not 
necessarily take twice as long to run to completion as it would at the full processor 
speed. One reason for this counter-intuitive result is that although the processor core 
is slowed down, the memory system is not. As a result, the core to memory 
performance ratio improves in the memory's favour. 



Simulations were performed to evaluate the present performance-setting 
technique against a known technique. In particular, the known technique is a 
'LongRun' power manager that is built-into a Transmeta Crusoe processor. The 
Transmeta's Crusoe processor has the LongRun power manager built into the 

15 processor firmware. LongRun is different from other known power management 
techniques in that it avoids the need to modify the operating system in order to effect 
the power management. LongRun uses the historical utiUsation of the processor to 
guide clock rate selection: it speeds up the processor if utilisation is high and 
decreases performance if utilisation is low. Unlike on more conventional processors, 

20 the power management policy can be implemented on the Crusoe processor relatively 
easily because the processor already has a hidden software layer that performs 
dynamic binary translation and optimisations. The simulations aimed to establish 
how effectively a poUcy such as LongRun that is implemented at such a low level in 
the software hierarchy can perform. The present technique was run alongside 

25 LongRun on the same processor. 

The simulations were performed on a Sony Vaio [RTM] PCG-CIVN 
notebook computer using the Transmeta Crusoe 5600 processor running at a number 
of fixed performance levels ranging from 300 Mhz to 600 Mhz in 100 Mhz 
30 performance-level steps. The simulations used a Mandrake 7.2 operating system with 
a modified version of the Linux 2.4.4- acl8 kernel. The workloads used m the 
comparative evaluation were as follows: Plaympeg SDL MPEG player library; 
Acrobat Reader for rendering PDF files; Emacs for text editing; Netscape Mail and 
News 4.7 for news reading; Konqueror 1 .9.8 for web browsing; and Xwelltris 1 .0.0 as 



24 



DYC Ref: P015674USY ' „ 
ARM Ref: P262 

a 3D game. The benchmark used for interactive shell commands was a record of a 
user performing miscellaneous shell operations during a span of about 30 minutes. To 
avoid possible variability due to the dynamic translation engine of the Crusoe 
processor, most benchmarks were run at least twice to warm up the dynamic 
translation cache, simulation data firom all but the last run was disregarded. 

The performance-setting algorithm according to the present technique has 
been designed so that it is unobtrusive to its host platform is the way timers are 
handled. For the purpose of the simulations the present technique provided a sub- 
millisecond resolution timer, without changing the way in which the Linux built-in 
10ms resolution timer worked. This was accomplished by piggybacking a timer 
dispatch routine (which checks for timer events) onto often executed parts of tiie 
kernel, such as the scheduler and system calls. 

Since tiie performance-setting algorithm according to the present technique is 
designed such that it has hooks to the kernel that allow it to intercept certain system 
calls to find interactive episodes and it is invoked on every task switch, it was 
straight-forward to add a few insbnctions to these hooks to manage timer dispatches. 
Each hook was augmented by implementing a read of tiie timestamp counter, a 
comparison against the time stamp of the next timer event and a branch to the timer 
dispatch routine upon success. In practice it was found that tiiis stirategy yielded a 
timer with sub-millisecond accuracy. 

Table 2 below details the timer statistics pertaining to the simulations. The 
worst-case timer resolution was bounded by the 10 millisecond (seems to be 
inconsistent with Table 2) time quantiun of the scheduler. However, since the events 
tiiat the performance-setting algoritiun according to the present technique is interested 
in measuring usually occur close to tiie timer triggers, the achieved resolution was 
considered to be adequate. It proved to be advantageous tiiat the soft-timers of the 
system stopped ticking when the processor was in sleep mode since this meant tiiat 
tiie timer interrupts did not change ttie sleep characteristics of tiie running operating 
system and program appUcations. The timers used had high resolution but low 
overhead. 
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These advantageous features of the timers facihtated development of an 
implementation having both an active mode and a passive mode. In the active mode 
the performance-setting algorithm according to the present technique was in control. 
In the passive mode the built-in LongRun power manager was in charge of 
5 performance although the intelligent energy manager of the present technique acted as 
an observer of the execution and performance changes. 



Table 2 



Cost of an access to a timestamp 
counter 


30 to 40 cycles 


Mean interval between timer checks 


— 0.1 milliseconds 


Tinier accuracy 


— 1 millisecond 


Average timer check and dispatch 
duration (including possible execution 
of an event handler) 


100 to 150 cycles 



10 

Monitoring the performance changes caused by LongRun was accomplished similarly 
to the timer dispatch routine. The intelligent energy manager 120 according to the 
present technique periodically read the performance level of the processor through a 
machine-specific register and compared the result to a previous value. If the two 

15 values were different, then the change was logged in a buffer. The intelligent energy 
manager according to the present technique includes a tracing mechanism that retains 
a log of significant events in a kernel buffer. This log includes performance-level 
requests fi-om the different poUcies, task pre-emptions, task IDs (identifiers), and the 
performance levels of the processor. In performing the simulations it was possible to 

20 compare LongRun and the performance-setting algorithm according to the present 
technique during the same execution mn: LongRun was is in control performance- 
setting while the intelligent energy manager 120 of the present technique was 
operable to output the decisions that it would have made on the same workload, had it 
been in control. This simulation strategy was used to objectively assess the 

25 differences between umepeatable runs of interactive benchmarks between the known 
LongRun technique and the present technique. 

26 
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In order to assess the overhead of using the measurement and performance- 
setting techniques, the performance-setting algorithm according to the present 
technique was instrumented with markers that kept track of the time spent in the 
5 performance-setting algorithm code at run-time. Although the run-time overhead of 
the present technique on a Pentium II was found to be around 0. 1 % to 0.5%, on the 
Transmeta Crusoe processor the overhead was between 1% and 4%. Further 
measiu-ements in virtual machines such as 'VMWare' and 'user-mode-linux' (UML) 
confirmed that the overhead of the performance-setting algorithms according to the 
10 present technique can be significantly higher in virtual machines than on traditional 
processor architectures. However this overhead could be effectively reduced by 
algorithm optimisation. 

MPEG (Motion Pictures Expert Group) video playback posed a difficult 
15 challenge for all of the tested performance-setting algorithms. Although the 
performance-setting algorithms typically put a periodic load on the system, the 
performance requirements can vary depending on the MPEG frame-type. As a 
consequence, if a performance-setting algorithm uses a comparatively long time- 
window corresponding to past (highly variable) MPEG frame-decoding events to 
20 predict future performance requirements, it can miss the execution deadlines for (less- 
representative) more computationally intensive frames. On the other hand, if the 
algorithm looks at only a short interval, then it will not converge to a single 
performance value but oscillate rapidly between multiple settings. Since each change 
in performance-level incurs a transition delay, rapid oscillation between different 
25 performance-levels is undesirable. The simulation results for LongRun confirm this 
oscillatory behaviour for the MPEG benchmark. 

The present technique deals with this problem of oscillation for the MPEG 
workload by relying on the interactive performance-setting algorithm at the top level 
30 of the hierarchy to bound worst-case responsiveness. The more conventional interval- 
based perspectives algorithm at the bottom level of the hierarchy is thus able to take a 
longer-term view of performance-level requirements. 
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Figure 8 is a table that details simulation measurement results for the 
'plaympeg' video player (http://www.lokigames.com/development/smpeg.php3) 
playing a variety of MPEG videos. Some of the intemal variables of the video player 
have been exposed to provide information about how the player is affected as the 
result of dynamically changing the processor performance-level during execution. 
These figures are shown in the MPEG decode column of the table. In particular, the 
'Ahead' variable measures how close to the deadline each frame decoding comes. 
The closeness to the deadline is expressed as cumulative seconds during the playback 
of each video. For maximum power efficiency, the Ahead variable value should be as 
close to zero as possible, although the slowest performance level of the processor puts 
a lower limit how much the Ahead value can be reduced. An 'Exactly on time field' 
in the right-most column of the table specifies the total number of frames that met 
their deadlines exactly. The more frames that are exactly on time, the closer the 
performance-setting algorithm is to the theoretical optimum. The data in the 
Execution Statistics column of the table of Figure 8 was collected by the intelligent 
energy manager 120 monitoring sub-system. To collect information about LongRim, 
the intelligent energy manager 120 was used in passive mode to gather a trace of 
performance changes without controlling the processor performance level. The Idle 
field specifies the fraction of time spent in the idle loop of the kernel (possibly doing 
housekeeping chores or just spinning) whereas the Sleep field specifies the fraction of 
time that the processor actually spends in a low-power sleep mode. It can be seen 
from the table in Figure 8 that for each of these performance measures the present 
technique performs considerably better than LongRun. 

Figure 9 is a table that lists processor performance level statistics collected 
during the runs of each workload. The fraction of time at each performance level is 
computed as a proportion of the total non-idle time during the run of the workload. 
The 'Mean perF level column of the table specifies the average performance levels 
(as the percentage of peak performance) during the execution of each workload. 
Since, in all cases, the mean performance level for each workload was lower using the 
present technique than for LongRun, the last column specifies the mean performance 
reduction achieved with regard to LongRun. The playback quality for both the 
LongRun workload and the workload of the present technique was the same i.e. 
identical frame rates and no dropped frames. 
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The results show that the present technique is more accurately able to predict 
the necessary performance level than the known LongRun technique. The increased 
accuracy results in an 11% to 35% reduction of the average performance levels of the 
processor during execution of the benchmarks. Since the amount of work between 
runs of a workload should stay the same, the lower average performance level implied 
that reduced idle and sleep times could be expected when the intelligent energy 
manager of the present technique is enabled. This expectation was affirmed by the 
simulation results. Similarly, the number of frames that exactly meet their deadlines 
increases when the intelligent energy manager of the present technique is enabled and 
the cumulative amount of time when decode is ahead of its deadline is reduced. 

The median performance level (highlighted with bold in each column of the 
table of Figure 9) also shows significant reductions. Whereas on most benchmarks 
the performance-setting algorithm according to the present technique settles on a 
single performance level below peak for the greatest fraction of execution time 
(>88%), LongRun usually sets the processor to run at full-speed. The exception to 
this general rule is the 'Danse De Cable' workload, where the performance-setting 
algorithm according to the present technique settles on the lowest two performance 
levels and oscillates between these two levels. The reason for this oscillatory 
behaviour is due to the specific performance levels on the Crusoe processor. The 
performance-setting algorithm according to the present technique would have elected 
to select a performance level of only slightly higher than 300 Mhz so that as the 
performance-level prediction fluctuated above and below the 300 MHz value, the 
target performance-level was quantized to the closest two performance levels. The 
most notable difference in performance between the known LongRun technique and 
the present technique is that LongRun appears to be over-cautious in that it ramps up 
the performance level very quickly when it detects significant amounts of processor 
activity. 

Over all workloads, the average processor performance level witii LongRun 
never fell below 80%, whilst the performance level set by the present technique fell to 
as low as 52% for the 'Red's Nightmare small' benchmark. The algorithm according 
to the present technique is more aggressive than LongRun but responds quickly when 
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the quality of service appears to have been compromised. Since LongRun does not 
have any information about the interactive performance, it is forced to act 
conservatively on a shorter time frame and the simulation results show that this leads 
to inefficiencies. 

Figure 10 comprises two graphs of results for playback of two different 
MPEG movies entitled 'Legendary' (Figure lOA) and 'Danse de Cable' (Figure 10 
B). Each graph illustrates the fraction of time spent at each of four processor 
performance levels (300, 400, 500, and 600 MHz) for both LongRun and the present 
technique. Although the playback quality of for each run was identical, it can be seen 
from the graphs that use of the algorithm according to the present technique meant 
that the processor spent significantly longer at below peak performance than it did 
when the LongRun technique specified the performance level. The results for 
playback of the 'Legendary' movie plotted in Figure lOA show that the algorithm 
according to the present technique settles on a performance level of 500 MHz. The 
results for the 'Danse de Cable' movie shown in Figure lOB reveal that using the 
algorithm according to the present technique, the processor switched between two 
performance-levels i.e. 300 MHz and 400 MHz. By way of contrast, for both of these 
movies the LongRun perfoimance setting algorithm chose the peak processor speed 
of 600 MHz for a dominant portion of the execution time. 

Figure 1 1 provides qualitative insight into the characteristics of the two 
different performance-setting policies. LongRun keeps switching the performance 
level up and down in fast succession, while the processor performance-level of the 
system when controlled according to the present technique stays close to a target 
performance level. The two graphs of Figure 1 1 A (top row) show the performance 
levels of the processor during a benchmark run with LongRun enabled. Figures 1 IB 
and lie (middle and bottom rows) show performance-level results for the same 
benchmark but with the algorithm of the present technique enabled. Figure 1 IB 
shows the actual performance levels during execution, while Figure 1 IC reflects the 
performance level that the performance-setting algorithm according to the present 
technique would request on a processor that could run at arbitrary performance levels 
(given the same max. performance). Note that in some cases, the desired performance 
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levels calculated by algorithm according to the present technique the are actually 
below the minimum achievable performance-level on the processor. 

Now consider simulation results for comparison of the two techniques on 
interactive workloads. Due to the difficulty in making interactive benchmark runs 
repeatable, interactive workloads are significantly harder to evaluate than the 
multimedia benchmarks. To circumvent this problem, empirical measurements were 
combined with a simple simulation technique. More specifically, the interactive 
benchmarks were run under the control of the native LongRun power manager and the 
intelligent energy manager 120 according to the present technique was only engaged 
in passive mode, so that it merely recorded the performance-setting decisions that it 
would have made but did not actually change the performance levels of the processor. 

Figure 12 shows the performance data that was collected during a simulation 
run for assessment of interactive workloads. Figure 12 A is a graph of percentage 
performance level against time (in seconds) for the LongRun technique and in this 
case the plotted results correspond to the actual performance levels of the processor 
during the measurement. Figure 12B is a plot of the quantized performance levels 
whereas Figure 12C is a plot of the raw performance levels as a function of time that 
the performance-setting algorithm of the present technique would have set, had it been 
in control of the processor. Note that if the algorithm of the present technique had in 
fact been in control, its performance-setting decisions would have had a different run- 
time impact from those made by LongRun. For this reason the time axes on the 
graphs of Figures 12B and 12C should be regarded as approximations. 

To get aroimd the time-skew problem in the statistics, the passive 
performance-level traces of the simulations according to the present technique were 
post-processed to assess the impact of the increased execution times that would have 
resulted from the use of the present technique instead of LongRun. Rather than 
looking at the entire performance-level trace, only on the interactive episodes were 
focussed on. The interactive performance-setting algorithm of the present technique, 
it includes functionality for finding durations of execution that have a direct impact on 
the user. This technique gives valid readings regardless of which algorithm is in 
control and was thus used to focus our measurements. Once the execution range for 
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an interactive episode had been isolated, the full-speed equivalent work done during 
the episode was computed for both LongRun and the present technique. Since during 
the measurement LongRun is in control of the CPU speed and it runs faster than it 
would do if the resent technique were in control, the episode duration of results 
5 corresponding to the present technique must be lengthened. First, the remaining work 
is computed for the present technique according to the following formula: 

Workpresent technique Remaining = WorkLongRun -Workpresent technique 

Next, the algorithm computed to what extent the length of the interactive 
10 episode needed to be stretched— assuming that the algorithm of the present technique 
continued to run at its predicted speed until it reached the panic threshold, at ran at 
full-speed after that. The statistics were adjusted accordingly. It was found that the 
results using this technique were close to what we observed on similar workloads 
(same benchmark but with a slightly different interactive load) running with the 
15 algorithm according to the present technique in active control of the processor. 
However, when the algorithm according to the present technique was actually in 
control, the number of performance-setting decisions was reduced and the 
performance-levels were more accurate. 

20 Figure 13 shows the statistics gathered using the above-described time-skew 

correction technique. Each of the six graphs in the figure graph comprises two stacked 
columns. The left-hand column on each graph relates to LongRun whereas the right- 
hand column relates to the present technique. Each column is stacked so as to 
represent the fraction of time spent in interactive episodes at each of the four 

25 performance levels supported in the computer. These performance levels — from 

bottom up are from 300 Mhz to 600 Mhz at 100 Mhz increments. Even from a high 

level, it is apparent that the algorithm according to the present technique spends more 
time at lower performance levels than LongRun does. On some benchmarks such as 
Emacs, there is hardly ever a need to go fast and the interactive deadlines are met 

30 while the machine stays at its lowest possible performance level. At the other end of 
the spectrum is the Acrobat Reader benchmark, which exhibits bimodal behaviour: 
the processor either runs at its peak level or at its minimum. Even on this benchmark 
many of the interactive episodes can complete in time at the minimum performance 
level of the processor. However, when it comes to rendering the pages, the peak 
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performance level of the processor is not sufficient to complete its deadlines within 
the user perception threshold. Thus, upon encountering a sufficiently long interactive 
episode, the algorithm according to the present technique switches the processor 
performance-level to its peak. By way of contrast, during the run of the Konqueror 
5 benchmark, the algorithm according to the present technique can take advantage of 
all four available performance levels of the processor. This can be compared with the 
LongRun strategy, which causes the processor to spend most of its time at the peak 
level. 

10 Overall, the simulation results detailed above with reference to Figures 8 to 

13, have shown how two performance-setting policies implemented at different levels 
in the software hierarchy behave on a variety of multimedia and interactive 
workloads. It was found that the Transmeta LongRun power manager, which is 
implemented in the processor's firmware, makes more conservative choices than the 

15 algorithm according to the present technique, which is implemented in the kemel of 
the operating system. On a set of multi-media benchmarks an 11% to 35% average 
performance level reduction was achieved by the algorithm according to the present 
technique over that achieved using the known LongRun technique. 

20 Since the performance-setting algorithm according to the present technique is 

implemented higher in the software stack than LongRun it is able to make decisions 
based on a richer set of run-time information, which in turn translates into increased 
accuracy. 

25 Although the firmware approach of LongRun was shown to be less accurate than an 
algorithm implemented in the kemel, it does not diminish its usefulness. LongRun has 
the crucial advantage of being operating system agnostic. It is recognised that the gap 
between low and high level implementations could be bridged by to providing a 
baseline performance-setting algorithm such as LongRun in firmware and exposing an 

30 interface to the operating system for the purpose of (optionally) refining processor 
performance-setting decisions. The hierarchy of performance-setting algorithms 
according to the present technique provides a mechanism to support such design. The 
bottom-most performance-setting policy on the stack could actually be implemented 
in the firmware of the processor 
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Although illustrative embodiments of the invention have been described in detail 
herein with reference to the accompanying drawings, it is to be understood that the 
invention is not limited to those precise embodiments, and that various changes and 
5 modifications can be effected therein by one skilled in the art without departing firom the 
scope and spirit of the invention as defined by the appended claims. 



34 



